{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/VertexAIVectorSearchDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google Vertex AI Vector Search\n",
    "\n",
    "This notebook shows how to use functionality related to the `Google Cloud Vertex AI Vector Search` vector database.\n",
    "\n",
    "> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.\n",
    "\n",
    "**Note**: LlamaIndex expects Vertex AI Vector Search endpoint and deployed index is already created. An empty index creation time take upto a minute and deploying an index to the endpoint can take upto 30 min.\n",
    "\n",
    "> To see how to create an index refer to the section [Create Index and deploy it to an Endpoint](#create-index-and-deploy-it-to-an-endpoint)  \n",
    "If you already have an index deployed , skip to [Create VectorStore from texts](#create-vector-store-from-texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install llama-index llama-index-vector-stores-vertexaivectorsearch llama-index-llms-vertex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Index and deploy it to an Endpoint\n",
    "\n",
    "- This section demonstrates creating a new index and deploying it to an endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO : Set values as per your requirements\n",
    "\n",
    "# Project and Storage Constants\n",
    "PROJECT_ID = \"[your_project_id]\"\n",
    "REGION = \"[your_region]\"\n",
    "GCS_BUCKET_NAME = \"[your_gcs_bucket]\"\n",
    "GCS_BUCKET_URI = f\"gs://{GCS_BUCKET_NAME}\"\n",
    "\n",
    "# The number of dimensions for the textembedding-gecko@003 is 768\n",
    "# If other embedder is used, the dimensions would probably need to change.\n",
    "VS_DIMENSIONS = 768\n",
    "\n",
    "# Vertex AI Vector Search Index configuration\n",
    "# parameter description here\n",
    "# https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index\n",
    "VS_INDEX_NAME = \"llamaindex-doc-index\"  # @param {type:\"string\"}\n",
    "VS_INDEX_ENDPOINT_NAME = \"llamaindex-doc-endpoint\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import aiplatform\n",
    "\n",
    "aiplatform.init(project=PROJECT_ID, location=REGION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Cloud Storage bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a bucket.\n",
    "! gsutil mb -l $REGION -p $PROJECT_ID $GCS_BUCKET_URI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an empty Index \n",
    "\n",
    "**Note :** While creating an index you should specify an \"index_update_method\" - `BATCH_UPDATE` or `STREAM_UPDATE`\n",
    "\n",
    "> A batch index is for when you want to update your index in a batch, with data which has been stored over a set amount of time, like systems which are processed weekly or monthly. \n",
    ">\n",
    "> A streaming index is when you want index data to be updated as new data is added to your datastore, for instance, if you have a bookstore and want to show new inventory online as soon as possible. \n",
    ">\n",
    "> Which type you choose is important, since setup and requirements are different.\n",
    "\n",
    "Refer [Official Documentation](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index) and [API reference](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex#google_cloud_aiplatform_MatchingEngineIndex_create_tree_ah_index) for more details on configuring indexes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE : This operation can take upto 30 seconds\n",
    "\n",
    "# check if index exists\n",
    "index_names = [\n",
    "    index.resource_name\n",
    "    for index in aiplatform.MatchingEngineIndex.list(\n",
    "        filter=f\"display_name={VS_INDEX_NAME}\"\n",
    "    )\n",
    "]\n",
    "\n",
    "if len(index_names) == 0:\n",
    "    print(f\"Creating Vector Search index {VS_INDEX_NAME} ...\")\n",
    "    vs_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(\n",
    "        display_name=VS_INDEX_NAME,\n",
    "        dimensions=VS_DIMENSIONS,\n",
    "        distance_measure_type=\"DOT_PRODUCT_DISTANCE\",\n",
    "        shard_size=\"SHARD_SIZE_SMALL\",\n",
    "        index_update_method=\"STREAM_UPDATE\",  # allowed values BATCH_UPDATE , STREAM_UPDATE\n",
    "    )\n",
    "    print(\n",
    "        f\"Vector Search index {vs_index.display_name} created with resource name {vs_index.resource_name}\"\n",
    "    )\n",
    "else:\n",
    "    vs_index = aiplatform.MatchingEngineIndex(index_name=index_names[0])\n",
    "    print(\n",
    "        f\"Vector Search index {vs_index.display_name} exists with resource name {vs_index.resource_name}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an Endpoint\n",
    "\n",
    "To use the index, you need to create an index endpoint. It works as a server instance accepting query requests for your index. An endpoint can be a [public endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public) or a [private endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc).\n",
    "\n",
    "Let's create a public endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_names = [\n",
    "    endpoint.resource_name\n",
    "    for endpoint in aiplatform.MatchingEngineIndexEndpoint.list(\n",
    "        filter=f\"display_name={VS_INDEX_ENDPOINT_NAME}\"\n",
    "    )\n",
    "]\n",
    "\n",
    "if len(endpoint_names) == 0:\n",
    "    print(\n",
    "        f\"Creating Vector Search index endpoint {VS_INDEX_ENDPOINT_NAME} ...\"\n",
    "    )\n",
    "    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(\n",
    "        display_name=VS_INDEX_ENDPOINT_NAME, public_endpoint_enabled=True\n",
    "    )\n",
    "    print(\n",
    "        f\"Vector Search index endpoint {vs_endpoint.display_name} created with resource name {vs_endpoint.resource_name}\"\n",
    "    )\n",
    "else:\n",
    "    vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(\n",
    "        index_endpoint_name=endpoint_names[0]\n",
    "    )\n",
    "    print(\n",
    "        f\"Vector Search index endpoint {vs_endpoint.display_name} exists with resource name {vs_endpoint.resource_name}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy Index to the Endpoint\n",
    "\n",
    "With the index endpoint, deploy the index by specifying a unique deployed index ID.\n",
    "\n",
    "**NOTE : This operation can take upto 30 minutes.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if endpoint exists\n",
    "index_endpoints = [\n",
    "    (deployed_index.index_endpoint, deployed_index.deployed_index_id)\n",
    "    for deployed_index in vs_index.deployed_indexes\n",
    "]\n",
    "\n",
    "if len(index_endpoints) == 0:\n",
    "    print(\n",
    "        f\"Deploying Vector Search index {vs_index.display_name} at endpoint {vs_endpoint.display_name} ...\"\n",
    "    )\n",
    "    vs_deployed_index = vs_endpoint.deploy_index(\n",
    "        index=vs_index,\n",
    "        deployed_index_id=VS_INDEX_NAME,\n",
    "        display_name=VS_INDEX_NAME,\n",
    "        machine_type=\"e2-standard-16\",\n",
    "        min_replica_count=1,\n",
    "        max_replica_count=1,\n",
    "    )\n",
    "    print(\n",
    "        f\"Vector Search index {vs_index.display_name} is deployed at endpoint {vs_deployed_index.display_name}\"\n",
    "    )\n",
    "else:\n",
    "    vs_deployed_index = aiplatform.MatchingEngineIndexEndpoint(\n",
    "        index_endpoint_name=index_endpoints[0][0]\n",
    "    )\n",
    "    print(\n",
    "        f\"Vector Search index {vs_index.display_name} is already deployed at endpoint {vs_deployed_index.display_name}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Vector Store from texts\n",
    "\n",
    "NOTE : If you have existing Vertex AI Vector Search Index and Endpoints, you can assign them using following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO : replace 1234567890123456789 with your actual index ID\n",
    "vs_index = aiplatform.MatchingEngineIndex(index_name=\"1234567890123456789\")\n",
    "\n",
    "# TODO : replace 1234567890123456789 with your actual endpoint ID\n",
    "vs_endpoint = aiplatform.MatchingEngineIndexEndpoint(\n",
    "    index_endpoint_name=\"1234567890123456789\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import modules needed\n",
    "from llama_index.core import (\n",
    "    StorageContext,\n",
    "    Settings,\n",
    "    VectorStoreIndex,\n",
    "    SimpleDirectoryReader,\n",
    ")\n",
    "from llama_index.core.schema import TextNode\n",
    "from llama_index.core.vector_stores.types import (\n",
    "    MetadataFilters,\n",
    "    MetadataFilter,\n",
    "    FilterOperator,\n",
    ")\n",
    "from llama_index.llms.vertex import Vertex\n",
    "from llama_index.embeddings.vertex import VertexTextEmbedding\n",
    "from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a simple vector store from plain text without metadata filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup storage\n",
    "vector_store = VertexAIVectorStore(\n",
    "    project_id=PROJECT_ID,\n",
    "    region=REGION,\n",
    "    index_id=vs_index.resource_name,\n",
    "    endpoint_id=vs_endpoint.resource_name,\n",
    "    gcs_bucket_name=GCS_BUCKET_NAME,\n",
    ")\n",
    "\n",
    "# set storage context\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use [Vertex AI Embeddings](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/embeddings/llama-index-embeddings-vertex) as the embeddings model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure embedding model\n",
    "embed_model = VertexTextEmbedding(\n",
    "    model_name=\"textembedding-gecko@003\",\n",
    "    project=PROJECT_ID,\n",
    "    location=REGION,\n",
    ")\n",
    "\n",
    "# setup the index/query process, ie the embedding model (and completion if used)\n",
    "Settings.embed_model = embed_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add vectors and mapped text chunks to your vectore store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input texts\n",
    "texts = [\n",
    "    \"The cat sat on\",\n",
    "    \"the mat.\",\n",
    "    \"I like to\",\n",
    "    \"eat pizza for\",\n",
    "    \"dinner.\",\n",
    "    \"The sun sets\",\n",
    "    \"in the west.\",\n",
    "]\n",
    "nodes = [\n",
    "    TextNode(text=text, embedding=embed_model.get_text_embedding(text))\n",
    "    for text in texts\n",
    "]\n",
    "\n",
    "vector_store.add(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running a similarity search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define index from vector store\n",
    "index = VectorStoreIndex.from_vector_store(\n",
    "    vector_store=vector_store, embed_model=embed_model\n",
    ")\n",
    "retriever = index.as_retriever()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score: 0.703 Text: eat pizza for\n",
      "Score: 0.626 Text: dinner.\n"
     ]
    }
   ],
   "source": [
    "response = retriever.retrieve(\"pizza\")\n",
    "for row in response:\n",
    "    print(f\"Score: {row.get_score():.3f} Text: {row.get_text()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add documents with metadata attributes and use filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input text with metadata\n",
    "records = [\n",
    "    {\n",
    "        \"description\": \"A versatile pair of dark-wash denim jeans.\"\n",
    "        \"Made from durable cotton with a classic straight-leg cut, these jeans\"\n",
    "        \" transition easily from casual days to dressier occasions.\",\n",
    "        \"price\": 65.00,\n",
    "        \"color\": \"blue\",\n",
    "        \"season\": [\"fall\", \"winter\", \"spring\"],\n",
    "    },\n",
    "    {\n",
    "        \"description\": \"A lightweight linen button-down shirt in a crisp white.\"\n",
    "        \" Perfect for keeping cool with breathable fabric and a relaxed fit.\",\n",
    "        \"price\": 34.99,\n",
    "        \"color\": \"white\",\n",
    "        \"season\": [\"summer\", \"spring\"],\n",
    "    },\n",
    "    {\n",
    "        \"description\": \"A soft, chunky knit sweater in a vibrant forest green. \"\n",
    "        \"The oversized fit and cozy wool blend make this ideal for staying warm \"\n",
    "        \"when the temperature drops.\",\n",
    "        \"price\": 89.99,\n",
    "        \"color\": \"green\",\n",
    "        \"season\": [\"fall\", \"winter\"],\n",
    "    },\n",
    "    {\n",
    "        \"description\": \"A classic crewneck t-shirt in a soft, heathered blue. \"\n",
    "        \"Made from comfortable cotton jersey, this t-shirt is a wardrobe essential \"\n",
    "        \"that works for every season.\",\n",
    "        \"price\": 19.99,\n",
    "        \"color\": \"blue\",\n",
    "        \"season\": [\"fall\", \"winter\", \"summer\", \"spring\"],\n",
    "    },\n",
    "    {\n",
    "        \"description\": \"A flowing midi-skirt in a delicate floral print. \"\n",
    "        \"Lightweight and airy, this skirt adds a touch of feminine style \"\n",
    "        \"to warmer days.\",\n",
    "        \"price\": 45.00,\n",
    "        \"color\": \"white\",\n",
    "        \"season\": [\"spring\", \"summer\"],\n",
    "    },\n",
    "]\n",
    "\n",
    "nodes = []\n",
    "for record in records:\n",
    "    text = record.pop(\"description\")\n",
    "    embedding = embed_model.get_text_embedding(text)\n",
    "    metadata = {**record}\n",
    "    nodes.append(TextNode(text=text, embedding=embedding, metadata=metadata))\n",
    "\n",
    "vector_store.add(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running a similarity search with filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define index from vector store\n",
    "index = VectorStoreIndex.from_vector_store(\n",
    "    vector_store=vector_store, embed_model=embed_model\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text: A pair of well-tailored dress pants in a neutral grey. Made from a wrinkle-resistant blend, these pants look sharp and professional for workwear or formal occasions.\n",
      "   Score: 0.669\n",
      "   Metadata: {'price': 69.99, 'color': 'grey', 'season': ['fall', 'winter', 'summer', 'spring']}\n",
      "Text: A pair of tailored black trousers in a comfortable stretch fabric. Perfect for work or dressier events, these trousers provide a sleek, polished look.\n",
      "   Score: 0.642\n",
      "   Metadata: {'price': 59.99, 'color': 'black', 'season': ['fall', 'winter', 'spring']}\n"
     ]
    }
   ],
   "source": [
    "# simple similarity search without filter\n",
    "retriever = index.as_retriever()\n",
    "response = retriever.retrieve(\"pants\")\n",
    "\n",
    "for row in response:\n",
    "    print(f\"Text: {row.get_text()}\")\n",
    "    print(f\"   Score: {row.get_score():.3f}\")\n",
    "    print(f\"   Metadata: {row.metadata}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text: A versatile pair of dark-wash denim jeans.Made from durable cotton with a classic straight-leg cut, these jeans transition easily from casual days to dressier occasions.\n",
      "   Score: 0.704\n",
      "   Metadata: {'price': 65.0, 'color': 'blue', 'season': ['fall', 'winter', 'spring']}\n",
      "Text: A denim jacket with a faded wash and distressed details. This wardrobe staple adds a touch of effortless cool to any outfit.\n",
      "   Score: 0.667\n",
      "   Metadata: {'price': 79.99, 'color': 'blue', 'season': ['fall', 'spring', 'summer']}\n"
     ]
    }
   ],
   "source": [
    "# similarity search with text filter\n",
    "filters = MetadataFilters(filters=[MetadataFilter(key=\"color\", value=\"blue\")])\n",
    "retriever = index.as_retriever(filters=filters, similarity_top_k=3)\n",
    "response = retriever.retrieve(\"denims\")\n",
    "\n",
    "for row in response:\n",
    "    print(f\"Text: {row.get_text()}\")\n",
    "    print(f\"   Score: {row.get_score():.3f}\")\n",
    "    print(f\"   Metadata: {row.metadata}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text: A denim jacket with a faded wash and distressed details. This wardrobe staple adds a touch of effortless cool to any outfit.\n",
      "   Score: 0.667\n",
      "   Metadata: {'price': 79.99, 'color': 'blue', 'season': ['fall', 'spring', 'summer']}\n"
     ]
    }
   ],
   "source": [
    "# similarity search with text and numeric filter\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"color\", value=\"blue\"),\n",
    "        MetadataFilter(key=\"price\", operator=FilterOperator.GT, value=70.0),\n",
    "    ]\n",
    ")\n",
    "retriever = index.as_retriever(filters=filters, similarity_top_k=3)\n",
    "response = retriever.retrieve(\"denims\")\n",
    "\n",
    "for row in response:\n",
    "    print(f\"Text: {row.get_text()}\")\n",
    "    print(f\"   Score: {row.get_score():.3f}\")\n",
    "    print(f\"   Metadata: {row.metadata}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parse, Index and Query PDFs using Vertex AI Vector Search and Gemini Pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "E0501 00:56:50.842446801  266241 backup_poller.cc:127]                 Run client channel backup poller: UNKNOWN:pollset_work {created_time:\"2024-05-01T00:56:50.841935606+00:00\", children:[UNKNOWN:Bad file descriptor {created_time:\"2024-05-01T00:56:50.841810434+00:00\", errno:9, os_error:\"Bad file descriptor\", syscall:\"epoll_wait\"}]}\n",
      "--2024-05-01 00:56:52--  https://arxiv.org/pdf/1706.03762.pdf\n",
      "Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.131.42, ...\n",
      "Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.\n",
      "HTTP request sent, awaiting response... 301 Moved Permanently\n",
      "Location: http://arxiv.org/pdf/1706.03762 [following]\n",
      "--2024-05-01 00:56:52--  http://arxiv.org/pdf/1706.03762\n",
      "Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2215244 (2.1M) [application/pdf]\n",
      "Saving to: ‘./data/arxiv/test.pdf’\n",
      "\n",
      "./data/arxiv/test.p 100%[===================>]   2.11M  --.-KB/s    in 0.07s   \n",
      "\n",
      "2024-05-01 00:56:52 (31.5 MB/s) - ‘./data/arxiv/test.pdf’ saved [2215244/2215244]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "! mkdir -p ./data/arxiv/\n",
    "! wget 'https://arxiv.org/pdf/1706.03762.pdf' -O ./data/arxiv/test.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of documents = 15\n"
     ]
    }
   ],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"./data/arxiv/\").load_data()\n",
    "print(f\"# of documents = {len(documents)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup storage\n",
    "vector_store = VertexAIVectorStore(\n",
    "    project_id=PROJECT_ID,\n",
    "    region=REGION,\n",
    "    index_id=vs_index.resource_name,\n",
    "    endpoint_id=vs_endpoint.resource_name,\n",
    "    gcs_bucket_name=GCS_BUCKET_NAME,\n",
    ")\n",
    "\n",
    "# set storage context\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "\n",
    "# configure embedding model\n",
    "embed_model = VertexTextEmbedding(\n",
    "    model_name=\"textembedding-gecko@003\",\n",
    "    project=PROJECT_ID,\n",
    "    location=REGION,\n",
    ")\n",
    "\n",
    "vertex_gemini = Vertex(\n",
    "    model=\"gemini-pro\",\n",
    "    context_window=100000,\n",
    "    temperature=0,\n",
    "    additional_kwargs={},\n",
    ")\n",
    "\n",
    "# setup the index/query process, ie the embedding model (and completion if used)\n",
    "Settings.llm = vertex_gemini\n",
    "Settings.embed_model = embed_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define index from vector store\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Response:\n",
      "--------------------------------------------------------------------------------\n",
      "The authors of the paper \"Attention Is All You Need\" are:\n",
      "\n",
      "* Ashish Vaswani\n",
      "* Noam Shazeer\n",
      "* Niki Parmar\n",
      "* Jakob Uszkoreit\n",
      "* Llion Jones\n",
      "* Aidan N. Gomez\n",
      "* Łukasz Kaiser\n",
      "* Illia Polosukhin\n",
      "--------------------------------------------------------------------------------\n",
      "Source Documents:\n",
      "--------------------------------------------------------------------------------\n",
      "Sample Text: Provided proper attribution is provided, Google he\n",
      "Relevance score: 0.720\n",
      "File Name: test.pdf\n",
      "Page #: 1\n",
      "File Path: /home/jupyter/llama_index/docs/examples/vector_stores/data/arxiv/test.pdf\n",
      "--------------------------------------------------------------------------------\n",
      "Sample Text: length nis smaller than the representation dimensi\n",
      "Relevance score: 0.678\n",
      "File Name: test.pdf\n",
      "Page #: 7\n",
      "File Path: /home/jupyter/llama_index/docs/examples/vector_stores/data/arxiv/test.pdf\n",
      "--------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\n",
    "    \"who are the authors of paper Attention is All you need?\"\n",
    ")\n",
    "\n",
    "print(f\"Response:\")\n",
    "print(\"-\" * 80)\n",
    "print(response.response)\n",
    "print(\"-\" * 80)\n",
    "print(f\"Source Documents:\")\n",
    "print(\"-\" * 80)\n",
    "for source in response.source_nodes:\n",
    "    print(f\"Sample Text: {source.text[:50]}\")\n",
    "    print(f\"Relevance score: {source.get_score():.3f}\")\n",
    "    print(f\"File Name: {source.metadata.get('file_name')}\")\n",
    "    print(f\"Page #: {source.metadata.get('page_label')}\")\n",
    "    print(f\"File Path: {source.metadata.get('file_path')}\")\n",
    "    print(\"-\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean Up\n",
    "\n",
    "Please delete Vertex AI Vector Search Index and Index Endpoint after running your experiments to avoid incurring additional charges. Please note that you will be charged as long as the endpoint is running.\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>⚠️ NOTE: Enabling `CLEANUP_RESOURCES` flag deletes Vector Search Index, Index Endpoint and Cloud Storage bucket. Please run it with caution.</b>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CLEANUP_RESOURCES = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Undeploy indexes and Delete index endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if CLEANUP_RESOURCES:\n",
    "    print(\n",
    "        f\"Undeploying all indexes and deleting the index endpoint {vs_endpoint.display_name}\"\n",
    "    )\n",
    "    vs_endpoint.undeploy_all()\n",
    "    vs_endpoint.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Delete index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if CLEANUP_RESOURCES:\n",
    "    print(f\"Deleting the index {vs_index.display_name}\")\n",
    "    vs_index.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Delete contents from the Cloud Storage bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if CLEANUP_RESOURCES and \"GCS_BUCKET_NAME\" in globals():\n",
    "    print(f\"Deleting contents from the Cloud Storage bucket {GCS_BUCKET_NAME}\")\n",
    "\n",
    "    shell_output = ! gsutil du -ash gs://$GCS_BUCKET_NAME\n",
    "    print(shell_output)\n",
    "    print(\n",
    "        f\"Size of the bucket {GCS_BUCKET_NAME} before deleting = {' '.join(shell_output[0].split()[:2])}\"\n",
    "    )\n",
    "\n",
    "    # uncomment below line to delete contents of the bucket\n",
    "    # ! gsutil -m rm -r gs://$GCS_BUCKET_NAME"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
